class: center, middle, inverse, title-slide # Applied Mathematics in Industry
from a Data Scientist’s Perspective ## One or Two Things I Wish I Had Learned In School ### Jay Lee
@AT&T
### 2021/02/10 --- class: inverse, center, middle # Views my own, not of employer --- # Introduction -- - Where (Role) `-->` Tool(s) -- - US Army (Automated Logistical Specialist) `-->` Database (Data Entry) -- - UNC (BS in Mathematical Decision Science) `-->` Matlab -- - New York Life Insurance (Actuarial Intern) `-->` Excel (Shortcut), Database (Query) -- - Georgia Tech (MS/PhD in Industrial Engineering) `-->` Matlab -- - EPA (Physical Scientist Intern) `-->` Database (MS Access) -- - UPS (Security Analyst in Corp. Security) `-->` R (Plotting), Database (Data Warehouse) -- - AT&T (Data Scientist in Chief Data Office) `-->` R (Packaging), Python, Big Data ??? - views on my own, not of employer - go over career journey since high school with focus on math-related background - have walk-thru handy for each tool - OS/shell/vim - R/python - workflow --- # Motivation -- .right-plot[ <img src="data:image/png;base64,#../img/cran-home.png" width="100%" /> ] - [CRAN](https://cran.r-project.org/) -- - install.packages(["ggplot2"](https://ggplot2.tidyverse.org/)) -- - R package development workshop in 2017 -- - [uncmbb](https://github.com/joongsup/uncmbb) package on CRAN -- - [The Carpentries](https://software-carpentry.org/lessons/) -- - Things I wish I had learned in school -- - Some didn't exist, but mostly I just didn't know better -- - Introductory by design, not comprehensive ??? - some topics I didn't care much, others were not available yet (while I was in school) --- class: inverse, center, middle # [Survey Result](https://www.surveymonkey.com/stories/SM-RH767LF2/) --- # Operating System -- - Mainly for Windows users -- - Know there are other [operating systems](https://en.wikipedia.org/wiki/Operating_system) -- - Local Machine (e.g., your computer) vs. Remote Server (e.g., school computing server) -- - Play with other operating systems (mainly [Linux](https://en.wikipedia.org/wiki/Linux)) -- - There are many [flavors](https://en.wikipedia.org/wiki/List_of_Linux_distributions) of Linux, but don't be discouraged! ([Ubuntu](https://ubuntu.com/download/desktop) is just fine) -- - Windows Subsystem for Linux ([WSL](https://docs/microsoft.com/en-us/windows/wsl/about)) -- - Try ([Virtual Box](https://www.virtualbox.org/), [USB boot](https://ubuntu.com/tutorials/create-a-usb-stick-on-windows#1-overview)) ??? - not an expert, but to bring it up to attention - not saying one OS is necessarily better than the others - but esp Windows users, be aware and test drive other OS, starting w/ any Linux - only at AT&T, started using non-Windows OS - meaning I've been an Windows' user exclusively for majority of my life - not b/c I loved it, but b/c I didn't know much about alternatives - not saying it's wrong - who knows, I might go back to Windows in 10 years - depends on employer's choice - open source aspect - with [Docker](https://www.docker.com/resources/what-container), no need to worry about which OS to use --- # Shell -- - [Terminal](https://en.wikipedia.org/wiki/Terminal_emulator) -- - Really, a terminal *emulator* -- - A graphical window -- - Lets you interact with your operating system through shell -- - [Shell](https://swcarpentry.github.io/shell-novice/01-intro/index.html) -- - Command line interface (CLI) -- - Scripting/programming language -- - Bash ("**B**ourne **a**gain **sh**ell") is default for many OS -- - Terminal -- `-->` Shell -- `-->` Operating System -- - Files, files, and more files -- - Project directory structure -- - Easier in action than in text ??? - shell: a thing/program/interface that lets you interact w/ operating system - many different types - I'm using default (bash) --- # Text Files -- - Most work in shell is text-based -- - Get used to working with plain text files -- - A variety of text editors -- - [Vim](https://en.wikipedia.org/wiki/Vim_(text_editor) - [Emacs](https://en.wikipedia.org/wiki/Emacs) - [Notepad/Notepad++](https://notepad-plus-plus.org/) - [Visual Studio Code](https://code.visualstudio.com/) - [Sublime](https://www.sublimetext.com/) - [RStudio](https://rstudio.com/) - [And more](https://en.wikipedia.org/wiki/Text_editor) -- - Pick a text editor and try using it for any text-based tasks -- - Coding - [Note taking](https://github.com/vimwiki/vimwiki) - [Presentation](https://github.com/yihui/xaringan) -- - How to write in a text editor? -- `-->` check out [R Markdown](https://rmarkdown.rstudio.com/) ??? - not saying one's better than others - personally using Vim for the last couple of years --- # Languages of Data Science -- - [R](https://www.r-project.org/) or [Python](https://www.python.org/)? -- Both! -- - "R is a language and environment for statistical computing and graphics" -- - "Python is a programming language that lets you work quickly and integrate systems more effectively" -- - Plotting -- - Bar chart -- - Line chart -- - Covers majority of plotting needs -- - Packaging -- - [R Package](https://r-pkgs.org/) -- - [Python Package](https://py-pkgs.org/) -- - Start w/ data package ([babynames](https://cran.r-project.org/web/packages/babynames/index.html), [uncmbb](https://cran.r-project.org/web/packages/uncmbb/index.html)) -- - And everything between plotting and packaging ??? - then, finally, data science languages - where it all started for me in RStudio conference in Orlando, FL 2017 - "R is more specific and Python is more general" - packaging is a process of putting commonly used codes and documentation together --- # Data Example ```r #install.packages("uncmbb") # if not already installed library(uncmbb) tail(unc) ``` ``` ## Season Game_Date Game_Day Type Where Opponent_School Result Tm Opp ## 2256 2020 2020-02-25 Tue REG H North Carolina State W 85 79 ## 2257 2020 2020-02-29 Sat REG A Syracuse W 92 79 ## 2258 2020 2020-03-03 Tue REG H Wake Forest W 93 83 ## 2259 2020 2020-03-07 Sat REG A Duke L 76 89 ## 2260 2020 2020-03-10 Tue CTOURN N Virginia Tech W 78 56 ## 2261 2020 2020-03-11 Wed CTOURN N Syracuse L 53 81 ## OT ## 2256 <NA> ## 2257 <NA> ## 2258 <NA> ## 2259 <NA> ## 2260 <NA> ## 2261 <NA> ``` ```r tail(duke) ``` ``` ## Season Game_Date Game_Day Type Where Opponent_School Result Tm Opp ## 2253 2020 2020-02-19 Wed REG A North Carolina State L 66 88 ## 2254 2020 2020-02-22 Sat REG H Virginia Tech W 88 64 ## 2255 2020 2020-02-25 Tue REG A Wake Forest L 101 113 ## 2256 2020 2020-02-29 Sat REG A Virginia L 50 52 ## 2257 2020 2020-03-02 Mon REG H North Carolina State W 88 69 ## 2258 2020 2020-03-07 Sat REG H North Carolina W 89 76 ## OT ## 2253 <NA> ## 2254 <NA> ## 2255 2OT ## 2256 <NA> ## 2257 <NA> ## 2258 <NA> ``` --- # Bar Chart Example .left-code[ ```r library(uncmbb) library(dplyr) library(ggplot2) *# prepare data for plotting dat <- unc %>% filter(Season >= 2005) %>% group_by(Result) %>% summarize(games = n()) *# plot aggregated data dat %>% ggplot(aes(x = Result, y = games)) + geom_bar(stat = "identity") + labs(title = "UNC Win/loss since 2005") ``` ] .right-plot[ <img src="data:image/png;base64,#things-I-wish-I-had-learned-in-school_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- # Line Chart Example .left-code[ ```r library(uncmbb) library(dplyr) library(ggplot2) *# prepare data for plotting dat <- unc %>% filter(Season >= 2005) %>% group_by(Season) %>% summarize(games = n(), wins = sum(Result == "W"), losses = sum(Result == "L"), win_pct = wins/games) *# plot aggregated data dat %>% ggplot(aes(x = Season, y = win_pct, group = 1)) + geom_line() + geom_point() + geom_smooth(method = "lm", se = FALSE) + geom_hline(yintercept = 0.5, linetype = "dashed", colour = "red") + scale_y_continuous(labels = scales::percent) + labs(title = "UNC Win % since 2005") ``` ] .right-plot[ <img src="data:image/png;base64,#things-I-wish-I-had-learned-in-school_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- # Data Science Workflow -- .right-plot[ <img src="data:image/png;base64,#../img/rp-overview.jpg" width="100%" /> ] - Example data science workflow ([source](https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext)) -- - Missing, but important: **Problem Formulation** -- - Iterative in nature -- - Emphasis on "Analysis" step in school -- - More emphasis on other steps in industry -- - Team sport -- - Team lead -- - Project managers -- - Data engineers -- - Data scientists --- # Parting Thoughts -- - In a nutshell, try -- - Ubuntu -- - Bash shell -- - Text editor -- - Bar/line charts in R/Python -- - Package things up in R/Python -- - Data science workflow -- - Other topics that are not covered -- - [Git (version control)](https://swcarpentry.github.io/git-novice/01-basics/index.html) -- - [SQL](https://www.w3schools.com/sql/sql_intro.asp) -- - [Blogging](https://bookdown.org/yihui/blogdown/) -- - Communication -- - Much more... ??? --- # Links - [Good Enough Practices in Scientific Computing](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) - [Carpentries Lesson on Shell](http://swcarpentry.github.io/shell-novice/) - [Happy Git and GitHub for the useR](https://happygitwithr.com/) - [Data Science at Command Line](https://www.datascienceatthecommandline.com/2e/) - [Editor War](https://en.wikipedia.org/wiki/Editor_war) - [Data Organization in Spreadsheets](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989) - [R for Data Science](https://r4ds.had.co.nz/) - [What They Forgot To Teach You About R](https://rstats.wtf/index.html) - [R Graphics Cookbook](https://r-graphics.org/) - [Project-Oriented Workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) - [Anaconda Data Science Toolkit](https://www.anaconda.com/products/individual) - [Why Jupyter Is Data Scientists’ Computational Notebook of Choice](https://www.nature.com/articles/d41586-018-07196-1) - [The First Notebook War](https://yihui.org/en/2018/09/notebook-war/#what-do-notebooks-and-spreadsheets-have-in-common) - [I Don't Like Notebooks](https://www.youtube.com/watch?v=7jiPeIFXb6U) --- class: inverse, center, middle # Questions? --- class: right, clear background-image: url(data:image/png;base64,#../img/cat.jpg) background-size: contain background-position: left <br><br> .font200[Thank You!] In the future, if any of the things in this talk ends up helping you in any way, please reach out and let me know! [<svg viewBox="0 0 16 16" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path fill-rule="evenodd" d="M1.75 2A1.75 1.75 0 000 3.75v.736a.75.75 0 000 .027v7.737C0 13.216.784 14 1.75 14h12.5A1.75 1.75 0 0016 12.25v-8.5A1.75 1.75 0 0014.25 2H1.75zM14.5 4.07v-.32a.25.25 0 00-.25-.25H1.75a.25.25 0 00-.25.25v.32L8 7.88l6.5-3.81zm-13 1.74v6.441c0 .138.112.25.25.25h12.5a.25.25 0 00.25-.25V5.809L8.38 9.397a.75.75 0 01-.76 0L1.5 5.809z"></path></svg>](mailto:uncmbbtrivia@gmail.com) [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"></path></svg>](https://joongsup.rbind.io) [<svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg>](https://www.linkedin.com/in/joongsupjaylee/) [<svg viewBox="0 0 496 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg>](https://github.com/joongsup/ksu-seminar) For now, please let me know how the presentation was by filling out the survey below! [<svg viewBox="0 0 384 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M384 112v352c0 26.51-21.49 48-48 48H48c-26.51 0-48-21.49-48-48V112c0-26.51 21.49-48 48-48h80c0-35.29 28.71-64 64-64s64 28.71 64 64h80c26.51 0 48 21.49 48 48zM192 40c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24m96 114v-20a6 6 0 0 0-6-6H102a6 6 0 0 0-6 6v20a6 6 0 0 0 6 6h180a6 6 0 0 0 6-6z"></path></svg>](https://www.surveymonkey.com/r/LNSVJ79)